Skip to content

[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models#24982

Merged
tlrmchlsmth merged 39 commits intovllm-project:mainfrom
tlrmchlsmth:tp_attn_fix_more_models
Sep 27, 2025
Merged

[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models#24982
tlrmchlsmth merged 39 commits intovllm-project:mainfrom
tlrmchlsmth:tp_attn_fix_more_models

Conversation

@tlrmchlsmth
Copy link
Copy Markdown
Member

@tlrmchlsmth tlrmchlsmth commented Sep 16, 2025

Purpose

Prior to this PR, in many cases, using TP Attn and EP MoEs with --tensor-parallel-size N --data-parallel-size M --enable-expert-parallel would result in factor N redundant work in the MoE layers.

This PR extends #24134 to other models, and to the naive and allgather_reducescatter All2All backends.

Test Plan

vllm serve {{MODEL}} -tp 2 -dp 2 --enable-expert-parallel --port 8192

lm_eval --model local-completions --tasks gsm8k --model_args model={{MODEL}},base_url={{BASE_URL}}/v1/completions,num_concurrent=50,max_retries=3,tokenized_requests=False --limit 100

Test Result

Qwen/Qwen3-30B-A3B-FP8:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.88|±  |0.0327|
|     |       |strict-match    |     5|exact_match|↑  | 0.94|±  |0.0239|

Qwen/Qwen3-Next-80B-A3B-Instruct (with --enforce-eager due to #25437):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.80|±  |0.0402|
|     |       |strict-match    |     5|exact_match|↑  | 0.74|±  |0.0441|

meta-llama/Llama-4-Scout-17B-16E:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.82|±  |0.0386|
|     |       |strict-match    |     5|exact_match|↑  | 0.82|±  |0.0386|

ibm-granite/granite-4.0-tiny-preview (with --enforce-eager due to #25437 (comment)):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.58|±  |0.0496|
|     |       |strict-match    |     5|exact_match|↑  | 0.55|±  |0.0500|

openai/gpt-oss-20b (main at TP4 is almost the same):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3685|±  |0.0133|
|     |       |strict-match    |     5|exact_match|↑  |0.2365|±  |0.0117|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
@mergify mergify bot added deepseek Related to DeepSeek models qwen Related to Qwen models labels Sep 16, 2025
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Sep 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 17, 2025
@tlrmchlsmth tlrmchlsmth added this to the v0.11.0 milestone Sep 18, 2025
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
@mergify mergify bot added llama Related to Llama models speculative-decoding labels Sep 21, 2025
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
@mergify mergify bot removed the needs-rebase label Sep 21, 2025
Runs but wrong answer in this case

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
xuechendi pushed a commit to vllm-project/vllm-gaudi that referenced this pull request Sep 30, 2025
After vllm-project/vllm#24982 merged, sequence
parallel MOE will be turned on when `enable_expert_parallel=True`,
`tp_size > 1` and `dp_size > 1`. Since for Gaudi, there is no choice for
`VLLM_ALL2ALL_BACKEND`, we can not easily bypass it. So this PR aims to
support the feature.

```python
class ParallelConfig:

  @Property
    def use_sequence_parallel_moe(self) -> bool:
        return (envs.VLLM_ALL2ALL_BACKEND
                in ("allgather_reducescatter", "naive",
                    "deepep_high_throughput", "deepep_low_latency")
                and self.enable_expert_parallel
                and self.tensor_parallel_size > 1
                and self.data_parallel_size > 1)

```

Update:
No hard requirement on vllm-project/vllm#25828

---------

Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com>
iboiko-habana pushed a commit to iboiko-habana/vllm-gaudi that referenced this pull request Oct 2, 2025
After vllm-project/vllm#24982 merged, sequence
parallel MOE will be turned on when `enable_expert_parallel=True`,
`tp_size > 1` and `dp_size > 1`. Since for Gaudi, there is no choice for
`VLLM_ALL2ALL_BACKEND`, we can not easily bypass it. So this PR aims to
support the feature.

```python
class ParallelConfig:

  @Property
    def use_sequence_parallel_moe(self) -> bool:
        return (envs.VLLM_ALL2ALL_BACKEND
                in ("allgather_reducescatter", "naive",
                    "deepep_high_throughput", "deepep_low_latency")
                and self.enable_expert_parallel
                and self.tensor_parallel_size > 1
                and self.data_parallel_size > 1)

```

Update:
No hard requirement on vllm-project/vllm#25828

---------

Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
…ject#24982)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
…ject#24982)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
…t#25814)

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: simon-mo <simon.mo@hey.com>
shyeh25 pushed a commit to shyeh25/vllm that referenced this pull request Oct 14, 2025
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
shyeh25 pushed a commit to shyeh25/vllm that referenced this pull request Oct 14, 2025
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: simon-mo <simon.mo@hey.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…ject#24982)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…ject#24982)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
…ject#24982)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models gpt-oss Related to GPT-OSS models llama Related to Llama models multi-modality Related to multi-modality (#4194) qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants